NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data

https://doi.org/10.1007/s10664-020-09905-9

Ma, Yuxing; Dey, Tapajit; Bogart, Chris; Amreen, Sadika; Valiev, Marat; Tutko, Adam; Kennard, David; Zaretzki, Russell; Mockus, Audris (March 2021, Empirical Software Engineering)
null (Ed.)
Full Text Available
Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem

Dey, Tapajit; Ma, Yuxing; Mockus, Audris (September 2019, In Proceedings of the 15th International Conference on Predictive Models and Data Analytics in Software Engineering)

Background: Open source requires participation of volunteer and commercial developers (users) in order to deliver functional high-quality components. Developers both contribute effort in the form of patches and demand effort from the component maintainers to resolve issues reported against it. Open source components depend on each other directly and transitively, and evidence suggests that more effort is required for reporting and resolving the issues reported further upstream in this supply chain. Aim: Identify and characterize patterns of effort contribution and demand throughout the open source supply chain and investigate if and how these patterns vary with developer activity; identify different groups of developers; and predict developers' company affiliation based on their participation patterns. Method: 1,376,946 issues and pull-requests created for 4433 NPM packages with over 10,000 monthly downloads and full (public) commit activity data of the 272,142 issue creators is obtained and analyzed and dependencies on NPM packages are identified. Fuzzy c-means clustering algorithm is used to find the groups among the users based on their effort contribution and demand patterns, and Random Forest is used as the predictive modeling technique to identify their company affiliations. Result: Users contribute and demand effort primarily from packages that they depend on directly with only a tiny fraction of contributions and demand going to transitive dependencies. A significant portion of demand goes into packages outside the users' respective supply chains (constructed based on publicly visible version control data). Three and two different groups of users are observed based on the effort demand and effort contribution patterns respectively. The Random Forest model used for identifying the company affiliation of the users gives a AUC-ROC value of 0.68, and variables representing aggregate participation patterns proved to be the important predictors. Conclusion: Our results give new insights into effort demand and supply at different parts of the supply chain of the NPM ecosystem and its users and suggests the need to increase visibility further upstream.
more » « less
Full Text Available
Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem

https://doi.org/10.1145/3345629.3345634

Dey, Tapajit; Ma, Yuxing; Mockus, Audris (August 2019, PROMISE'19: Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering)

Background: Open source requires participation of volunteer and commercial developers (users) in order to deliver functional high-quality components. Developers both contribute effort in the form of patches and demand effort from the component maintainers to resolve issues reported against it. Open source components depend on each other directly and transitively, and evidence suggests that more effort is required for reporting and resolving the issues reported further upstream in this supply chain. Aim: Identify and characterize patterns of effort contribution and demand throughout the open source supply chain and investigate if and how these patterns vary with developer activity; identify different groups of developers; and predict developers' company affiliation based on their participation patterns. Method: 1,376,946 issues and pull-requests created for 4433 NPM packages with over 10,000 monthly downloads and full (public) commit activity data of the 272,142 issue creators is obtained and analyzed and dependencies on NPM packages are identified. Fuzzy c-means clustering algorithm is used to find the groups among the users based on their effort contribution and demand patterns, and Random Forest is used as the predictive modeling technique to identify their company affiliations. Result: Users contribute and demand effort primarily from packages that they depend on directly with only a tiny fraction of contributions and demand going to transitive dependencies. A significant portion of demand goes into packages outside the users' respective supply chains (constructed based on publicly visible version control data). Three and two different groups of users are observed based on the effort demand and effort contribution patterns respectively. The Random Forest model used for identifying the company affiliation of the users gives a AUC-ROC value of 0.68, and variables representing aggregate participation patterns proved to be the important predictors. Conclusion: Our results give new insights into effort demand and supply at different parts of the supply chain of the NPM ecosystem and its users and suggests the need to increase visibility further
more » « less
Full Text Available
World of code: an infrastructure for mining the universe of open source VCS data

https://doi.org/10.1109/MSR.2019.00031

Ma, Yuxing; Bogart, Chris; Amreen, Sadika; Zaretzki, Russell; Mockus, Audris (July 2019, MSR '19 Proceedings of the 16th International Conference on Mining Software Repositories)

Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flows? To answer such questions we a) create a very large and frequently updated collection of version control data for FLOSS projects named World of Code (WoC) and b) provide basic tools for conducting research that depends on measuring interdependencies among all FLOSS projects. Our current WoC implementation is capable of being updated on a monthly basis and contains over 12B git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.
more » « less
Full Text Available
World of code: an infrastructure for mining the universe of open source VCS data

https://doi.org/10.1109/MSR.2019.00031

Ma, Yuxing; Bogart, Christopher; Amreen, Sadika; Zaretzki, Russell; Mockus, Audris (May 2019, MSR '19: Proceedings of the 16th International Conference on Mining Software Repositories)

Open source software (OSS) is essential for modern society and, while substantial research has been done on individual (typically central) projects, only a limited understanding of the periphery of the entire OSS ecosystem exists. For example, how are tens of millions of projects in the periphery interconnected through technical dependencies, code sharing, or knowledge flows? To answer such questions we a) create a very large and frequently updated collection of version control data for FLOSS projects named World of Code (WoC) and b) provide basic tools for conducting research that depends on measuring interdependencies among all FLOSS projects. Our current WoC implementation is capable of being updated on a monthly basis and contains over 12B git objects. To evaluate its research potential and to create vignettes for its usage, we employ WoC in conducting several research tasks. In particular, we find that it is capable of supporting trend evaluation, ecosystem measurement, and the determination of package usage. We expect WoC to spur investigation into global properties of OSS development leading to increased resiliency of the entire OSS ecosystem. Our infrastructure facilitates the discovery of key technical dependencies, code flow, and social networks that provide the basis to determine the structure and evolution of the relationships that drive FLOSS activities and innovation.
more » « less
Full Text Available
A Methodology for Measuring FLOSS Ecosystems

https://doi.org/https://doi.org/10.1007/978-981-13-7099-1_1

Amreen, Sadika; Bichescu, Bogdan; Bradley, Randy; Dey, Tapajit; Ma, Yuxing; Mockus, Audris; Mousavi, Sara; Zaretzki, Russell (July 2019, Towards Engineering Free/Libre Open Source Software (FLOSS) Ecosystems for Impact and Sustainability)

FLOSS ecosystem as a whole is a critical component of world’s computing infrastructure, yet not well understood. In order to understand it well, we need to measure it first. We, therefore, aim to provide a framework for measuring key aspects of the entire FLOSS ecosystem. We first consider the FLOSS ecosystem through lens of a supply chain. The concept of supply chain is the existence of series of interconnected parties/affiliates each contributing unique elements and expertise so as to ensure a final solution is accessible to all interested parties. This perspective has been extremely successful in helping allowing companies to cope with multifaceted risks caused by the distributed decision-making in their supply chains, especially as they have become more global. Software ecosystems, similarly, represent distributed decisions in supply chains of code and author contributions, suggesting that relationships among projects, developers, and source code have to be measured. We then describe a massive measurement infrastructure involving discovery, extraction, cleaning, correction, and augmentation of publicly available open-source data from version control systems and other sources. We then illustrate how the key relationships among the nodes representing developers, projects, changes, and files can be accurately measured, how to handle absence of measures for user base in version control data, and, finally, illustrate how such measurement infrastructure can be used to increase knowledge resilience in FLOSS.
more » « less
Full Text Available
Protamine loops DNA in multiple steps

https://doi.org/10.1093/nar/gkaa365

Ukogu, Obinna A; Smith, Adam D; Devenica, Luka M; Bediako, Hilary; McMillan, Ryan B; Ma, Yuxing; Balaji, Ashwin; Schwab, Robert D; Anwar, Shahzad; Dasgupta, Moumita; et al (May 2020, Nucleic Acids Research)
null (Ed.)
Abstract Protamine proteins dramatically condense DNA in sperm to almost crystalline packing levels. Here, we measure the first step in the in vitro pathway, the folding of DNA into a single loop. Current models for DNA loop formation are one-step, all-or-nothing models with a looped state and an unlooped state. However, when we use a Tethered Particle Motion (TPM) assay to measure the dynamic, real-time looping of DNA by protamine, we observe the presence of multiple folded states that are long-lived (∼100 s) and reversible. In addition, we measure folding on DNA molecules that are too short to form loops. This suggests that protamine is using a multi-step process to loop the DNA rather than a one-step process. To visualize the DNA structures, we used an Atomic Force Microscopy (AFM) assay. We see that some folded DNA molecules are loops with a ∼10-nm radius and some of the folded molecules are partial loops—c-shapes or s-shapes—that have a radius of curvature of ∼10 nm. Further analysis of these structures suggest that protamine is bending the DNA to achieve this curvature rather than increasing the flexibility of the DNA. We therefore conclude that protamine loops DNA in multiple steps, bending it into a loop.
more » « less
Full Text Available

Search for: All records